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ABSTRACT: Multipliers and Accumulators are an imperative component of DSP application systems. It 
plays a vital role in high speed digital signal processing (DSP), image processing multiplier and 
processing fast Fourier transform (FFT). These computations require large number of multiplication 
and addition operation which requires dedicated MAC and Arithmetic and Logic Unit (ALU) 
architectures. Multipliers and adders are the key components of these arithmetic units as it determines 
the overall performance of the system, i.e. speed, power and area consumed. This paper reviews the 
components of MAC circuits analyzing and comparing the performance parameters in terms of area, 
speed and power consumption. Carry Look Ahead adder is the most suitable for low power and for 
high performance, whereas Wallace tree multiplier can be implemented for chips having larger area 
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and require high speed. Various other multiplier and adder circuits are also reviewed. 
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1. INTRODUCTION 

A Multiplier Accumulator (MAC) unit 
performs the operations of multiplying two 
numbers and accumulates the result in a 
register repeatedly to perform continuous and 
complex operations. MAC can speed up the 
process of computation. It has numerous 
applications in digital signal processing, 
including filtering, and convolution. MAC also 
has tremendous applications in audio and video 
signal processing, Artificial Intelligence (AI), 
machine learning, military and defense [1]. 
Since these operations require a_ cyclic 
application of multiplication and addition, the 
speed of execution depends on the overall 
performance of the MAC unit [2]. Using a MAC 
unit improves the accuracy and also reduces 
time delay for computing dot product, matrix 
multiplication, artificial neural networks, and 
various mathematical computations. 

The demand for fast and portable electronic 
devices has been rising, as they help in 
accomplishing our day to day tasks. The 
computation speed of processor is highly 
dependent on these arithmetic units. The 
arithmetic unit complexity increases with 
improvement of the speed and _ power 
performance of the processor. Therefore, it is 
essential to design the unit to reduce the 
complexity based on the algorithm and number 


of components used. 

MAC being the fundamental unit of DSP, it 
significantly affects the performance of the 
system. MAC unit comprises of a multiplier and 
adder. The design parameters of any 
architecture completely depend on the basic 
building blocks which are the multiplier and the 
adder. The enhancement of these building 
blocks can improve the performance of the 
overall unit. Along these lines, the enhancement 
of the multiplier speed and area is a noteworthy 
test for the framework architects. This test can 
be effectively overwhelmed by the utilization of 
various multiplier techniques and appropriate 
adder circuit. The objective of this paper is to 
design a MAC unit to perform computation with 
optimum speed, low power dissipation and chip 
area. 


The remainder of this paper is organized as 
follows. Section 2 explain the basic operation of 
MAC unit and illustrate an application of MAC. 
Section 3 explains the design and performance 
requirements of MAC unit. Section 4 presents 
various adder circuits. Section 5 presents 
various multiplier circuits. 


Rakesh and Sunitha have implemented the 
novel design of 32-bit MAC unit with 32-bit 
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adder unit using Weinberger adder and a Vedic 
multiplier based on Urdhva Sutra [1]. This 
model is efficient in terms of low delay and 
power consumption. The author mentioned 
there is scope to improve the area constraint. 
Rajesh and Reddy have also compared various 
adder circuits, ripple carry adder (RCA), carry 
save adder(CSA), the kogge stone adder (KSA) 
and DKG gate adder in the paper using 
simulation carried out with Xilinx ISE 14.7.They 
concluded that DKG adder gate using a Vedic 
multiplier with reversible computing leads to 
significantly reduce the total time delay 
[2].0zcan and Erdem have implemented Barrett 
in the paper, a multiplier using fast carry select 
adder, which uses less hardware [3]. Zhou et al. 
have implemented FPGA based high-speed 
floating-point MAC which improves the 
performance of computing systems such as 
matrix multiplication and matrix vector 
multiplication [4]. 


Kamble and Ugalehave compared various 
multipliers and concluded that Wallace tree 
algorithm with Booth recoder is recommended 
for fast computations with drawback of larger 
area whereas select shift and add multiplication 
algorithm can be used for chips having less area 
[5]. Various Vedic multipliers are analyzed for 
the performance parameters like delay, power 
and area by the authors, Lad and Bendre [6]. 
The authors concluded that Ekadhikena 
Purvena sutra can reduce the area by 70.26% 
and increase the speed by 90.41% compared to 
Urdhwa Tiryakbhyam sutra and Nikhilam sutra 
gives optimum results in term of area, delay and 
power. The ALU designs built using Vedic 
multiplier and Booth multiplier are compared 
by the Lachireddy and Ramesh, with Vedic 
multiplier having better performance in terms 
of area and power [7]. The author Chanfralekha 
et al.have designed and implemented 8 bit and 
16-bit ALU to perform the operations of sum, 
not and nor [8]. The reversible gate decreases 
the use and loss of data bits and also improves 
the power utilization. 


The most vital mathematical operations in 
DSP applications is repeated multiplication that 
can be effectively implemented using MAC. The 
objective of this paper is to design MAC to give 
an optimum result in terms of area, speed and 
power consumption. Various multipliers and 
adders are analyzed and compare. Performance 
parameters are compared. 
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2. MAC UNIT AND ITS APPLICATION 

The basic MAC unit comprises of multiplier, 
adder and an accumulator. The basic MAC 
operation is A=A+x*y, where A is accumulator, x 
and y operands. The adder calculates the sum of 
the product from the multiplier and stores it in 
the accumulator. The multiplier output is 
transferred to the accumulator through the 
adder. If inputs of bit size N are given to the 
multiplier, then the adder should be of bit size 
2N, hence giving an output of bit size 2N+1. The 
block diagram of MAC unit is shown in Fig. 1. 


MEMORY 


MULTIPLIER 
Z=X*Y 


Fig. 1 MAC Unit Block Diagram [11] 


MAC is an integral part of DSP. A MAC unit 
performs the elementary computations of a 
many Digital Signal Processing (DSP) 
applications and algorithms involving repeated 
multiplication and accumulation. It improves 
the overall speed of DSP systems. There are 
many applications in DSP which includes the 
convolution, filtering, and inner products. The 
Discrete Cosine Transforms (DCT) or discrete 
wavelet transforms (DWT) are the nonlinear 
functions generally use in DSP methods. Since 
they are fundamentally accomplished by cyclic 
application of multiplication and addition, the 
overall speed of the addition and multiplication 
arithmetic computations are determined by the 
performance of calculation and the time 
required to execute. Multiplication and 
accumulate operations are distinctive for digital 
filters. The MAC operations are also required 
for Fast Fourier Transforms [2]. Fig. 2 
illustrates an application of MAC unit in matrix 
multiplication and dot product. The X and Y 
data from memory are read concurrently and 
dual operand fetch operations are performed. A, 
B, C are the products of corresponding X and Y 
data which is accumulated in the register after 
each operation using an adder [14]. 


3. DESIGN OF MAC UNITS 


The overall performance of the MAC unit 
depends on the factors like time delay, power 
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and area. There are different multipliers and 
adder circuits which are combined based on the 
requirements of the system. 


Adder: Commonly used adders in the design of 
processing devices are:Carry save adder, carry 
select adder, ripple carry adder (RCA), carry 
look-ahead adder (CLA). Propagation delay and 
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critical path are two important factors that 
decide the performance of adders. 


Fig. 2 Application of MAC in dot product [14] 


Accumulator: Accumulator is a register. It 
consists of an input registers one of which is an 
accumulator register, Arithmetic unit and a shift 
register. The shift registers use D Flip Flops that 
are configured as Parallel In Parallel Out (PIPO). 
The Arithmetic unit performs the operation on 
the values stored in theinputs registers.The 
result is then transferred to accumulator 
through the shift register. The adder outputs is 
generated in parallel and are enormous, PIPO 
refers the adder output as the input bits and are 
received in parallel and the corresponding 
output bits are also generated in parallel mode. 
The accumulator register output is getting from 
any one of the inputs to a corresponding adder 
[9]. By the use of PIPO Shift registers data, the 
bits are transferred in one clock cycle, reducing 
the time delay. 


Multipliers: They play an important role in 
modern digital signal processing systems and 
various other applications. Using a_ highly 
efficient multiplier can increase the overall 
performance of the system. It is desirable to 


have a multiplier with the following 
characteristics. 

e It must be highly accurate. 

e It should be able to _ perform 


computations at a very high speed and 
reduced time delays. 
e It should have optimum area. 
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e It should consume less power. 
4. ADDER 


The computation hardware is the most 
important component in a determining the 
functioning of the entire system. As such, 
optimizing the performance of the adder circuit 
greatly improves the execution of the 
computational operations. Therefore, the 
improvement of the architecture of the adder 
circuit is a key concern [10]. Commonly used 
adders are Carry save adder, carry select adder, 
ripple carry adder (RCA), carry look-ahead 
adder (CLA), among which Carry select adder is 
one of the widely used circuits for 
implementing fast arithmetic computations. 


Carry Skip Adder: Carry Skip Adder uses carry 
propagation Pi signals using XOR gate. They can 
reduce the carry propagation. These Pi Signals 
are given as inputs to AND gate. The result of 
the AND gate is used as a selection bit to 
multiplexer which provides the Cout. This 
reduces the critical path and results in the 
improvement in the delay of the ripple carry 
adder, as illustrated in Fig. 3. Its power 
consumption and combined path delay are 
more than all the other three adders [12]. 


Fig. 34-bit Carry skip adder [12] 
Carry Save Adder (CSA): Functionally CSA is 
identical to full adder. Adder design is 


b3 a3 b2 a2 bil al bo a0d co 


significantly affected by the critical delay in the 
carry chain path. Hence the adder design should 
have a short carry chain path. CSA are usually 
used in multiplier circuits. Multiple ripple carry 
adders runconcurrently which provides faster 
results. Each stage consists of three input full 
adder with sum and carry as output. Carry 
propagates diagonally through the array of adder 
cells as shown in Fig. 4. 


Carry Select Adder: This adder comes under 
the category of conditional sum adder. Sum and 
carries are calculated by assuming the input 
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carry as one and zero, before the occurrence of 
the input carry. Based on the carry input, the 
multiplexer selects the required carry and sum 
outputs.In most significant bit adders, one 
adding unit assumes carry input as 1 for 
performing addition, while the other assumes 
carry input as 0. Actual values of sum and carry 
are calculated which is selected by the carry out 
calculated from the previous stage. The 
selection is done using a multiplexer. This adder 
may have a lower path delay, but it uses more 
power. Since the CLA adder has more energy 
consumption so has less power delay product 
than other adders. Carry Select Adder is shown 


A3 B3 Cin A2 B2 Cin Al Bl Cin AO BO Cin 


Cout S4 S3 $2 S1 so 
in Fig. 5. 
Fig. 4. 4-bit Carry Save Adder 


Fig. 5. Carry Select Adder 


Carry Look Ahead Adder (CLA): The CLA adder, 
as illustrated in Fig. 6, is one of the high-speed 
adders used for the addition of two numbers. 
CLA use modified full adders to calculate each 
bit position. The look-ahead modules provide 
separate group carry generate (Gi) and group 
carry propagate (Pi) outputs. The Gi indicates 
the carry generated within the group, whereas 
the Pi indicates the incoming carry that would 
propagate across the group. CLA consists of two 
logic blocks, Partially Full Adder (PFA) and 


A3 B38 A2  B2 Al Bi AO BO 


Carry Look Ahead logic (CLA logic). Each PFA 
has two outputs Pi and Gi given to CLA logic, 
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and one carry input Ci from the CLA logic to 
each PFA[12]. When compared to other adders, 
it outperforms them in terms of power 
consumption and gate count resulting in 
reduction in area utilized. As bit width 
increases, the propagation delay incurred in 
Ripple carry adder (RCA) and CLA adder also 


CARRY LOOK-AHEAD LOGIC 


increases. 


Fig. 6. 4-bit Carry Look Ahead Adder [12] 


5. MULTIPLIER 


Multipliers are an essential part of any 
system. The multiplierspeed determines the 
performance of the processor. A_ better 
performing multiplier should improve the 
speed of computations. The arrangement of the 
components ascertains theperformance of 
various multipliers circuits available. Based on 
the application requirements the multiplier 
architecture is selected. 


Booth Multiplier: Booths multiplication 
algorithm was invented by Andrew Donald 
Booth in the year 1950. Booth multiplier 
outperforms conventional methods _ of 
multiplication by reducing the number of 
iteration steps. According to this algorithm it 
multiplies two binary numbers in twos 
complement notation, and produces result 
preserving its sign. The booth algorithm 
requires examination of the multiplier bits; 
changes are made as per the algorithm and 
shifting of the partial product. It fastens the 
multiplier operation, reducing chains of the 
algorithm thus reducing the number of 
additions and subtractions required to produce 
the results. The flowchart of booth multiplier is 
shown in Fig. 7 [9]. 


Wallace tree multiplier: It is an efficient 
hardware circuit designed to achieve higher 
speeds of operation. It reduces the number of 
partial products and uses carry select adder 
(CSA) for the addition of partial products. This 
results in faster computations as the total delay 
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is proportional to the logarithmic length of the 
operand. Wallace Multiplier is superior with 
respect to speed and low power consumption 
than other multipliers. Fig 8 illustrates a 4-bit 
Wallace tree block diagram. 


The Wallace tree method of multiplication 
has three steps. Initially, bit wise multiplication 
is performed on the input. Then, using half and 
full adders the number of partial products is 
reduced by half. Finally, they are grouped 
together and added. 


A-0O 


Qi+O 
B — MULTIPLICAND 
Q — MULTIPLIER 
COUNT —n 


ARITHMETIC RIGHT 
SHIFT: A, Q, Qu 
COUNT ~ COUNT - 1 


Fig. 7 Flow Chart of Booth Multiplier [9] 


a3bl © a2b2_ a3b0a2%1alb2 a2b0albiadb2 albO  a0b1 abo 


Fig. 8 4-bit Wallace Tree Multiplier block 
diagram [9] 

Array multiplier: It is an efficient layout of a 
combinational multiplier. Array multiplier 
functions are based on add and shift algorithm. 
Two binary numbers are multiplied by using an 
array of half and full adders. Adding and shifting 
operations are done simultaneously while 
checking the bits of the multiplier followed by 
the addition of partial products. The structure 
of this multiplier is more systematic and 
regular. However, when compared with other 
multipliers, it consumes larger power and 
suffers due to huge delays. Array Multiplier is 
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not economical as it has more power 
consumption as well as it requires larger 
number of gates because of which area is also 
increased. Even though it’s a fast multiplier, it 
has more hardware complexity [13]. Table 1a 
illustrates array multiplier and Table 1b shows 
the comparison of various multiplier circuits. 


Table la Array Multiplier [13] 


A3 A2 Al AO 
INPUTS 
x B3 B2 Bl Bo 
BOxA3 BOxA2 BOxA1 BOxAO 
+ B1xA3 B1xA2 B1xAl_ B1xA0 
c sum sum sum sum 
INTERNAL SIGNALS | 
+ B2xA3 B2xA2 B2xAl B2xA0 
c sum sum sum sum 
+ B2xA3 B2xA2 B2xAl B2xA0 
vz Y6 Ys Y4 Y3 ¥2 Y1 Yo OUTPUTS 


Table1b Comparison of multiplier circuits [13] 


re Wallace Tree ' ni 
Parameter Array Multiplier Multiplier Booth's Multiplier 
| Operation lize \ 
Speed Less High Highest 
Time Delay More (n+1)tFA | Log(n) Less(ntFA/2 + ntFA) 
Area Maximum Medium Minimum 
Complexity Less More Most 
Power 
consumption Most More Less 
EPGA Less efficient Not efficient Most efficient 


implementation 


Vedic multiplier: There are total 16 sutras in 
Vedic multiplication of which the most efficient 
one is Urdhva- Triyakbhyam Sutra. It is the 
fastest sutra and is commonly referred to as UT 
multiplier. In UT multiplier the multiplication is 
performed by generating the multiplier partial 
products and simultaneously adding the partial 
products. This parallel operation results in 
reduction of delays. Hence UT multiplier 
circuits are implemented to achieve high 
performance digital circuits. Using multiple 2x2 
Vedic multipliers and adding the products using 
a adder, large multipliers are implemented. 
Further reduction of power consumption can be 
achieved using reversible logic gates which 
have low power consumption and low heat 
dissipation. Peres, Feynman, DKG are some 
examples of irreversible logic gates [9]. Fig. 10 
shows the line diagram for multiplication of two 
4-bit numbers based on Urdhva- Triyakbhyam 
Sutra. 


6. CONCLUSION 


The performance of MAC unit plays an 
important role in applications intended for DSP 
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related application such as_ correlation, 
convolution, digital filter design and digital 
image processing. The design parameters of any 
architecture completely depend on the basic 
building blocks which are the multiplier and the 
adder. 


Carry Look Ahead adder is the most suitable 
for low power and for high performance. Carry 
select and carry save adders are fast with a 
drawback in terms of area requirement. Ripple 
carry adder and carry skip adder are simple and 
consume less power and with least gate count. 
Chips which have less accessing area use select 
shift and add multiplication algorithm having 
less logical element, whereas Wallace tree 
multiplier can be implemented for chips having 
larger area. Array multiplier is very slow when 
compared to other algorithms. Depending on 
user operation one can prefer fast algorithm 
like Vedic, Wallace or low design resources 
algorithm like shift and add. Specific multiplier 
and adder architecture are chosen based on the 
application. 
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Fig. 10 Line diagram based on UT Multiplier 
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